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The  availability  of  large  scale  multitasked  parallel  architectures  introduces  the  following 
processor  assignment  problem  for  pipelined  computations.  Given  a  set  of  tasks  and  their 
precedence  constraints,  along  with  their  experimentally  determined  individual  response  times 
for  different  processor  sizes,  find  an  assignment  of  processors  to  tasks.  Two  objectives  interest 
us:  minimal  response  given  a  throughput  requirement,  and  maximal  throughput  given  a 
response  time  requirement.  These  assignment  problems  differ  considerably  from  the  classical 
mapping  problem  in  which  several  teisks  share  a  processor;  instead,  we  assume  that  a  large 
number  of  processors  are  to  be  assigned  to  a  relatively  small  number  of  tcisks.  In  this  paper 
we  develop  efficient  assignment  algorithms  for  different  classes  of  task  structures.  For  a  p 
processor  system  and  a  series-parallel  precedence  graph  with  n  constituent  tasks,  we  provide 
an  O(np^)  algorithm  that  finds  the  optimal  assignment  for  the  response  time  optimization 
problem;  we  find  the  assignment  optimizing  the  constrained  throughput  in  0{npHogp)  time. 
Special  cases  of  linear,  independent,  and  tree  graphs  are  also  considered.  In  addition,  we 
also  examine  more  efficient  algorithms  when  certain  restrictions  are  placed  on  the  problem 
parameters.  Our  techniques  are  applied  to  a  task  system  in  computer  vision. 


‘Research  was  supported  by  the  National  Aeronautics  and  Space  Administration  under  NASA  Contract 
No.  NASl-18605  while  the  author  was  in  residence  at  the  Institute  for  Computer  Applications  in  Science 
and  Engineering,  NASA  Langley  Research  Center,  Hampton,  VA  23665-5225. 
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1  Introduction 


In  recent  years  much  research  hcis  been  devoted  to  the  problem  of  mapping  large  computations  onto 
a  system  of  parallel  processors.  Various  aspects  of  the  general  problem  have  been  studied,  including 
dilferent  parallel  architectures,  task  structures,  communication  issues  and  load  balancing  [11,  16]. 
Typically,  experimentally  observed  performance  (e.g.,  speedup  or  response  time)  is  tabulated  as  a 
function  of  the  number  of  processors  employed.  We  are  particularly  interested  in  tabulations  of 
response  time,  which  we  will  refer  to  as  response-time  funclions.  Our  work  is  also  motivated  by  the 
growing  availability  of  multitasked  parallel  architectures,  such  as  PASM  [37],  the  NCube  system 
[18],  and  Intel’s  iPSC  system  [7],  in  which  it  is  possible  to  map  tasks  to  processors  and  allow  parallel 
execution  of  multiple  tasks  in  different  logical  partitions. 

In  this  paper,  we  consider  the  problem  of  optimizing  performance  of  a  task  structure  on  a 
parallel  architecture,  given  a  large  supply  of  processors,  and  the  experimentally  determined  response 
time  functions  for  its  constituent  tasks.  The  task  structure  describes  the  sequencing  of  various 
computational  activities  (tasks)  that  are  to  be  applied  to  each  of  many  data  sets;  the  data  sets 
themselves  are  pipelined  through  the  task  structure.  We  refer  to  this  class  of  computations  as 
pipeline  computations.  This  problem  arises  in  data  parallel  applications  such  as  the  computer 
vision  e.xample  we  consider  in  this  paper,  when  individual  tasks,  e.g.  a  fast  Fourier  transform, 
are  highly  parallelizable.  Unlike  prior  treatments  of  the  mapping  problem  we  are  interested  in 
the  case  where  there  arc  many  more  processors  than  tasks.  Rather  than  ask  which  tasks  must 
share  a  processor,  we  ask  how  many  processors  each  task  should  be  allocated.  We  are  interested 
in  both  the  response  time  of  the  task  structure  on  one  data  set,  and  in  the  throughput  (data  sets 
processed  per  unit  time).  We  consider  the  dual  problems  of  minimizing  response  time  subject  to  a 
throughput  constraint,  and  maximizing  throughput  subject  to  a  response  lime  constraint.  These 
problems  are  complimentary,  in  the  sense  that  allocation  to  increase  throughput  may  have  the  side 
effect  of  increasing  response  time,  and  vice  versa. 

Under  the  assumption  that  the  constituent  task  response  time  functions  completely  characterize 
performance,  we  show  that  p  processors  can  be  optimally  allocated  to  an  7i-node  series-parallel  task 
structure  in  0{np'^)  time.  We  study  separately  the  special  cases  of  linear,  and  tree  structures  and 
show  a  Oijip^)  procedure;  we  also  consider  response  time  function  characteristics  such  as  convexity 
which  are  exploited  to  achieve  even  more  efficient  algorithms.  Our  methods  arc  applied  to  the  task 
of  motion  estimation  in  a  computer  vision  system;  we  present  several  experimental  results  for  both 
the  response  time  as  well  as  the  throughput  problem. 

The  problem  of  mapping  workload  to  processors  has  attracted  a  great  deal  of  attention  in 
the  literature,  leading  to  a  number  of  problem  formulations.  One  often  views  the  computation  in 
terms  of  a  graph,  where  nodes  represent  computations  and  edges  represent  communication;  for  an 
example,  see  [2].  In  this  case,  mapping  means  assigning  each  node  (task)  to  a  processor.  One  view 
of  the  mapping  problem  is  that  the  computation  graph  represents  a  distributed  program,  with  a 
serial  thread  of  control.  Tasks  have  different  affinities  for  different  heterogeneous  processors;  the 
problem  is  to  assign  tasks  to  processors  so  that  the  total  sum  of  e.xccution  times  (of  all  tasks) 
and  communication  costs  is  minimized.  Fundamental  contributions  to  this  problem  are  made  in 
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[4, 39, 41].  However,  the  objective  function  for  this  problem  docs  not  capture  any  parallelism  among 
the  tasks.  Another  mapping  problem  formulation  views  the  architecture  as  a  graph  whose  nodes 
are  processors  and  whose  edges  identify  processors  able  to  communicate  directly.  The  dilation 
of  a  computation  graph  edge  {v,v)  is  the  minimum  distance  (in  the  processor  graph)  between 
the  processors  to  which  u  and  v  are  respectively  assigned.  The  dilation  of  the  graph  itself  is  the 
maximum  dilation  among  ail  computation  graph  edges.  Dilation  is  a  measure  of  how  well  the 
mapping  preserves  locality  between  nodes  in  the  mapped  computation  graph.  Results  concerning 
the  minimization  of  dilation  can  be  found  in  [8,  19,  32,  36],  and  their  references.  Yet  another 
formulation  directly  models  execution  time  of  a  data  parallel  computation  as  a  function  of  the 
chosen  mapping,  and  attempts  to  find  a  mapping  that  minimizes  the  execution  time.  Workload 
may  again  be  represented  as  a  graph,  with  edges  representing  data  communication.  Nodes  are 
mapped  to  processors  in  such  a  way  that  each  processor’s  workload  is  approximately  the  same,  for 
example,  see  [1,  5,  24,  33,  35].  Formulations  using  simulated  annealing  or  neural  networks  attempt 
to  minimize  an  “energy”  function  that  heurislically  quantifies  the  cost  of  the  partition  [6,  17]. 
Other  interesting  formulations  consider  mapping  highly  structuied  compulations  onto  pipelined 
multiprocessors  [25],  and  mapping  systolic  algorithms  onto  hypercubes  [22).  The  problem  we  study 
is  distinctly  different  than  these,  in  that  it  seeks  the  assignment  of  multiple  processors  to  a  task, 
rather  than  multiple  tasks  to  a  processor. 

Recently,  some  studies  consider  the  scheduling  of  tasks  on  multitasked  parallel  architectures 
whore  each  task  can  be  assigned  a  set  of  processors.  The  objective  in  such  work,  for  example 
in  [3,  13,  27],  is  to  find  a  schedule  that  minimizes  completion  time.  A  fundamental  difierence, 
between  the  processor  assignment  problem  studied  in  this  paper  and  the  above  scheduling  problems, 
is  that  scheduling  formulations  allow  tasks  to  be  queued  or  sequenced.  In  contrast,  the  nature 
of  pipeline  computations  recommends  assigning  at  least  one  processor  to  each  task:  executable 
images  which  would  be  swapped  into  main  memory  for  each  data  set  under  scheduling,  would 
remain  in  main  memory  under  our  assignment  formulation.  The  problem  of  assigning  processors 
to  a  set  of  independent  tasks  where  each  task  is  a  chain  of  modules  is  considered  in  [10].  This 
differs  from  our  problem,  as  neither  response- time  functions  nor  task  precedence  is  treated.  In 
other  formulations,  each  task  requires  a  specific  number  of  processors;  in  this  case,  the  problem  of 
scheduling  tasks  on  a  partitionable  hypercube  or  mesh  connected  architectures  has  been  studied 
[9,  14,  23,  29].  Pipeline  compulations  are  studied  in  [25,  38].  In  [38],  heuristics  are  given  for 
scheduling  planar  acyclic  task  structures  and  in  [25],  a  methodology  is  presented  for  analyzing 
pipeline  computations  using  Petri  nets  together  with  techniques  for  partitioning  computations.  We 
have  not  discovered  treatments  that  address  optimal  processor  assignment  to  pipeline  computations, 
althougli  our  solution  approach  (dynamic  programming)  is  related  to  those  in  [4]  and  [41]. 

This  paper  is  organized  as  follows.  Section  §2  introduces  notation,  and  formalizes  the  response¬ 
time  problem  and  the  throughput  problem.  Section  §3  develops  some  preliminary  results  about 
response  time  functions  that  will  be  used  throughout  tlic  pai)er.  Section  §4  closely  examines  two 
re.sponse-time  problems  associated  with  linear  arrays  of  tasks,  and  Section  §5  applies  these  results  to 
tasks  structured  as  trees  or  more  general  series-parallel  graphs.  Section  §6  shows  how'  the  problem 


2 


tasks 

Number  of  processors 

1  2  3  4  5  6  7  8 

29 

16 

El 

9 

Di 

6 

4.5 

4 

tz 

40 

21 

14 

11 

8.5 

8 

Bi 

5 

ts 

10 

5.5 

3.4 

3 

2.5 

2 

1.5 

2 

U 

20 

12 

10 

9 

8 

7 

6 

5 

h 

15 

10 

8 

5 

4 

3.5 

3 

2.5 

Table  1:  Example  of  Response  time  functions 

of  maximizing  throughput  subject  to  a  response-time  constraint  can  be  solved  using  solutions  to 
the  response-time  problem.  Section  §7  discusses  application  of  our  techniques  to  actual  problems, 
and  Section  §8  summarizes  this  work. 

2  Problem  Definition 

A  jnpeline  computation  is  a  quadruple  V  =<  K,T,F,G  >  where 

•  K  =  {1,  ...,p}  is  a  set  of  identical  processors. 

•  r  =  {ti,  is  a  set  of  tasks  labeled  such  that  ii  is  always  the  first  task  and  tn+i  the  last 

task  executed  on  each  data  set.  We  will  assume  that  the  last  la.>k  tn^i  is  a  “dummy"  task 
that  requires  no  processing — it  is  used  for  convenience  of  notation  in  the  graph  G,  described 
below. 

•  F  =  {/i,...,/n+i}  is  a  collection  of  response-time  functions  /,  :  K  Jli^  for  each  task.  For 
notational  convenience  we  assume  that  /i(0)  =  oo  for  all  f  =  1, . . . ,  n.  We  also  assume  that 
fn+ii^)  =  0  for  all  X,  so  that  no  processors  need  ever  be  assigned  to  the  dummy  task.  It 
is  often  convenient  to  think  of  the  discrete  function  /,  as  a  table,  a  format  we  shall  use  in 
this  paper.  Later,  we  will  also  use  F  to  denote  the  response  time  functions  for  a  whole  task 
structure. 

•  C?  =  {T,E)  is  a  directed  acyclic  graph  (DAG)  describing  the  precedence  relation  for  the  tasks 
in  T.  Thus,  {ti,tj)  €  E  if  U  immediately  precedes  tj. 

An  example  of  response  time  table  for  n  =  5  and  21  =  8  is  shown  in  Table  1.  Each  row  of  the 
entire  table  is  a  response  time  function  for  a  particular  task.  In  the  couisc  of  the  paper  we  will  be 
constructing  examples  to  demonstrate  the  use  of  our  algorithms  for  various  graph  structures;  these 
examples  will  use  the  response  time  functions  in  this  table. 

Our  definition  of  a  pipeline  computation  extends  earlier  definitions  [25,  38]  to  include  the  em¬ 
pirically  determined  response-time  functions.  Observe  that  f,{L)  may  include  the  communication 
costs  inherent  in  e.xecuting  t,  on  k  processors,  as  well  as  the  communication  costs  t,  lay  suffer 
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communicating  with  predecessor  and/or  successor  tasks  in  T.  This  paper  assumes  that  alLperfor- 
mance  dependencies  on  communication  are  captured  in  the  response  time  functions.  Our  problem 
formulation  does  not  therefore  attempt  to  deal  with  any  issues  related  to  “matching”  the  task 
structure  topology  to  the  architecture  topology.  It  implicitly  assumes  that  performance  is  indepen¬ 
dent  of  which  processors  are  assigned  to  a  task.  These  assumptions  are  reasonable  when  the  i  ost 
of  communication  is  largely  independent  of  the  distance  between  communicating  processors  (os  is 
the  case  aith  the  Intel  iPSC/2  [7]).  and  the  communication  bandwidth  is  sufiiciently  high  for  us  to 
ignore  effects  due  to  contention  between  pairs  of  communicating  tasks.  They  are  also  reasonable 
for  compute-bound  applications,  for  which  load- balancing  of  the  type  we  study  is  a  major  concern. 
The  computer  vision  application  we  later  consider  is  compute-bound. 

Let  A:T  -*  1.  denote  a  feasible  assignment  of  processors  to  tasks  such  that  Yl?=:i  -  P 
>l(ii)  >  1  for  3^11  it  where  1  <  i  <  7i.  Observe  that  we  do  not  require  all  p  processors  to  be  assigned, 
as  it  is  possible  that  increasing  the  number  of  processors  used  actually  hampers  performance.  In 
addition,  observe  that  each  task  must  be  assigned  at  lea-st  one  processor;  this  condition  clearly 
differentiates  between  an  assignment  and  a  schedule. 

For  a  pipeline  computation  V  and  assignment  (mapping)  >1,  define  the  following: 

•  5(7^,  A)  =  maxi<j<n  f,{A{i,)),  the  largest  response  time,  under  A,  among  all  tasks. 

•  A('P,A)  =.  S{P,A)~^.  We  will  later  argue  that  this  quantity  is  the  maximal  throughput 
under  assignment  >1,  i.c.,  the  maximum  rate  at  which  successive  data  sets  can  be  processed 
by  the  task  system. 

•  L  =  {l\l  is  a  path  in  G  starting  from  ending  in  tn+i}-  L  is  thus  the  set  c5  all  complete 

paths  through  G.  We  will  write  each  /  G  T  as  aset  {t’l,  h  =  l,u.  =  n-fl,  1  <  ^  <  n+1, 

with  I  consisting  of  the  edges  (^i,  jf;,),  — 

•  R{'P,A)  =  niaxig£,X^,g//,(A(f,)),  the  “length”  of  the  longest  path  through  G.  R{V,A)  is 
thus  the  total  time  required  to  e.xecute.  one  data  set,  i.e.,  the  response  time. 

With  these  definitions  we  formulate  two  problems. 

Response  time  problem: 

Given  a  pipeline  computation  V  and  ihroughpul  lequircment  A,  find  an  assignment 
A*  such  that  A(P,/1*)  >  A,  and  R{V,A")  <  R{V,A)  for  every  feasible  assignment 
A  which  satisfies  A(P,  A)  >  A. 

We  arc  also  interested  in  determining  how  the  optimal  response  time  R{V,  A’)  behaves  as  a 
function  of  p,  the  maximum  number  of  available  processors.  In  other  words,  we  are  interested 
in  obtaining  the  response  time  function  for  (he  entire  computation  T:  the  values  of  R{V,A‘)  for 
diffeieiit  values  of  p.  We  will  call  this  P's  optimal  response  time  function,  or  sometimes  simply  the 
response  time  function  (the  optimality  being  understood). 
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Throughput  problem: 

Given  a  pipeline  computation  V  and  response  time  requirement  p,  find  an  assignment 
A“  sucii  that  R{V,A*)  <  p  ,  and  A.{V,A’)  >  A{V,A)  for  every  feasible  assignment 
A  which  satisfies  R{V,  A)  <  p. 

The  response  time  problem  arises  when  weha\c  asteady  stream  of  input  data  arriving  at  a  fixed 
rate  and  the  system  must  complete  processing  each  data  set  as  soon  as  possible.  The  throughput 
problem  arises  when  there  is  flexibility  in  the  amount  of  time  it  takes  to  process  one  data  set 
but  the  throughput  must  be  maximized  to  handle  high  input  data  rates.  Both  conditions  appea.r 
in  real-time  applications.  Our  approach  will  be  to  focus  first  on  the  response-time  problem,  for 
different  task  structures;  in  Section  §6  we  then  show  how  solutions  to  the  response  time  problem 
can  be  used  to  solve  the  throughput  problem. 

3  Preliminaries 

Much  of  this  paper  is  devoted  to  the  issue  of  decomposing  a  large  task  structure  into  a  set  of  smaller 
task  structures  and  constructing  a  response  time  function  for  the  large  structure  from  response  tinre 
functions  for  the  smaller  structures.  This  is  accomplished  by  first  separately  studying  algorithms 
for  handling  simple  task  structures  such  as  tasks  in  scries  and  tasks  in  parallel.  Then  more  complex 
task  structures  such  as  trees  and  series- parallel  gt.iphs  are  treated  by  decomposing  the  optirtriziition 
procedure  to  handle  series  and  parallel  components  of  the  overall  task  structure. 

Given  x  (x  <  p)  processors  and  a  task  structure  consisting  only  of  two  tasks  ti ,  to,  with  response 
time  functions  /i,/2,  v/e  wish  to  determine  y  such  that  assigning  y  processors  to  ti  and  x  -  y  to 
to  satisfies  the  throughput  requirement  and  minimizes  the  overall  response  time.  If  we  tabulate 
this  minimal  response  time  for  each  value  of  x,  then  we  obtain  a  response  lime  function  for  the 
aggregate  of  ti  and  /2.  Note  that  this  function  captures  optimality  and  is  thu.s  an  optimal  response 
time  function.  In  general,  given  a  set  of  task  structures  {Pi, . . . ,  P^},  wdiere  for  j  =  l,...,m,Pj  =< 
K,Tj,Fj,Gj  >,  we  extend  the  notion  of  re.sponse  time  function  for  a  single  task  to  a  response  time 
function  for  an  entire  pipeline  computation:  let  :  Z  —  7/2  be  the  response  time  function  for  Pj, 
i.c.,  Fj{x)  is  the  optimal  response  time  achieved  for  Pj  using  .j;  processors.  Supjjosc  also  that  we  have 
an  in-node  graph  (7  that  describes  a  precedence  relation  on  {7*1 , . . . , Pm}-  Me  may  view  each  Py  as 
an  arbitrary  task,  even  though  Pj  may  itsclfhavc  a  complex  subtask  structure.  We  wish  to  construct 

the  optimal  response  time  function  for  the  structure  Q  =  (^K,  {Pi, .  •  •!7^m}jU’^i{7^},(7^,  given  a 

throughput  constraint  A.  Wc  accomplish  this  by  solving  a  number  of  response-time  problems;  for 
every  x  e  [l,p]  processors,  wc  determine  the  minimal  response  time  /r(x)  achievable  by  allocating  no 
more  than  x  processors  among  the  task  structures  Pj  in  such  a  way  that  the  throughput  requirement 
is  satisfied.  /i(x)  becomes  the  optimal  response  lime  function  for  Q.  which  now  can  be  treated  as 
a  task  itself  with  a  known  response-time  function. 

Wc  arc  interested  in  properties  of  optimal  response  lime  functions  that  arc  conserved  through 
such  an  aggregation  procedure.  Two  questions  arc  particularly  important;  (i)  what  is  the  minimum 
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number  uf  processoib  iiceJeil  fur  Q  to  meet  the  throughput  constraint,  and  (li)  what  is  the  maximum 
number  of  processors  that  Q  should  be  allocated?  The  a^iswer  to  the  first  question  is  straightforward 
whereas  the  answer  to  the  second  requires  additional  analysis. 

First  consider  the  throughput  constraint  question.  Let  ua(Fj)  denote  the  minimum  number  of 
processors  Vj  must  be  allocated  in  order  to  meet  throughput  constraint  A.  For  a  single  task  i„ 
denotes  the  minimum  that  must  be  assigned  to  task  It,  i.e,  us{it)  =  niin^g2{/‘  •  fi{^)  ^ 
Observe  that  any  distribution  of  tasks  to  Q  must  assign  at  leasi  u,\{'Pj)  processors  to  Vj  if  Q  is  to 
meet  the  throughput  requirement.  As  this  is  true  for  each  Vj,  it  is  clear  that 


i=i 


(1) 


This  is  true  regardless  of  the  structure  of  Q.  It  is  also  true  that  if  every  Vj  is  allocated  u\{Vj) 
processors,  then  Q's  throughput  is  at  least  A.  One  need  only  perform  an  easy  induction  on  the 
number  of  nodes  in  the  precedence  graph  to  establish  that  Q's,  throughput  is  the  inverse  of  the 
ma.\imal  response-time  among  all  tasks  in  Q.  This  shows  that  the  inequality  in  equation  (1)  can  be 
reversed,  thereby  implying  equality.  Thus,  the  rule  for  computing  minimal  processor  requirements 
for  Q  is  simple,  and  general:  add  the  minimal  requirements  of  Q’s  constituent  tasks. 

To  answer  the  second  question,  especially  when  Q  is  complex,  we  need  to  manipulate  the 
functions  so  that  ^.ertain  conditions  arc  satisfied.  For  a  response  time  function  /(x),  define  the 
reduced  response  lime  function  f(x)  as: 


Note  that  /  is  monotonically  decreasing  (non  increasing),  whereas  /  need  not  be,  and  can  be 
defined  both  for  single  tasko  as  well  as  for  whole  computations  bv  using  the  appropriate  response 
time  function.  In  scvc;al  applications,  increasing  communication  _osts  when  a  large  number  of 
processors  is  used  can  force  response  tirrres  to  increase  with  increasing  x.  In  general,  we  would 
like  to  treat  response  tirrre  firtrctions  that  behave  arbitrarily  (exhibit  several  local  minima)  with 
incre.ising  x.  The  adjustment  above  will  prevent  assigning  ‘Hoo  marry*’  processors.  A  processor 
assigtrnrent  x  is  called  reducible  if  3r/  <  x  :  <  /(x).  It  is  otherwise  irreducible.  For  obvious 

reasons,  we  seek  irreducible  assignrrtcrits.  In  the  extanrplc  in  Table  1  the  response  tinre  for  task  tj, 
'•c.,  /aCx),  can  be  reduced  while  all  other  functions  cannot.  After  the  adjustment,  we  have  the 
reduced  response  time  function  with  /^(S)  =  1.5  which  assigns  ortly  7  processors  to  task 

We  rtc.\t  derive  sotrtc  properties  of  reduced  response  time  functions  that  wc  will  later  use  in  our 
algorithrits.  Consider  first  a  simple  case  of  two  clcmcrrtal  tasks  ti  and  ti  and  their  aggregate,  s. 
Suppose  /t(x)  and  /’(x)  arc  the  response  time  functions  for  ii  and  I2  and  //(x,,X2)  is  a  real-valued 
function  increasing  in  both  arguments.  Define 

/six)  =  mm  {n{My),hi{x-y))}.  (2) 

0<y<x 
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Here  /,  is  the  optimal  response  time  function  of  the  aggregate  task  s,  written  as  some  function  of 
the  response  time  functions  of  ti  and  Z2-  In  this  paper,  JI  is  usually  a  sum  (for  series  tasks)  or  a 
maximum  (for  parallel  tasks).  Define 


min  {Jl{h{y)j'2ix-y))]-  (3) 

^  0<y<x 

We  next  show  that: 

Lemma  3.1  For  all  x  =  fs{x)  =  £Jix). 

Proof:  We  first  show  that  /^(x)  is  monotone  decreasing  in  x,  and  therefore  /^(x)  is  already 
irreducible.  Since  fi  and  ^  are  monotone  decreasing  and  II  is  increasing,  for  any  y 

-  y))  >  nUi{y),h{x  +  1  -  y)). 

Therefore, 

min  {//(/i(!/),  h{x  -  2/))}  >  min  {7/(/i(2/),  /^(x  +  1  -  y))}  , 


that  is,  /j(x)  is  decreasing. 

Next,  for  any  x>y>0,  fi{y)  <  fi(y)  and  ^(x  -y)<  /2(x  -  y).  Thus 

Jlifiiy),  h{z  -  y))  <  JI{h{y),  h{x  -  y)) 


and  hence 

^  0<y<x  -  y))}  <  Qmm^{7/(/i(y):/2(i  -  !/))}  =  Ux). 

As  this  is  true  for  all  x  =  1, . .  .,71,  it  follows  that 

{/,(!/)}  ^  -  mm  {Livi}  for  all  x. 

0<y<x  ’  0<y<x 

But,  the  left-hand-side  of  the  above  is  simply  /^(x)  (by  definition);  the  right-hand-side  is 
fs(x)  (also  by  definition),  showing  that  /^(x)  <  7,(x)  for  all  x  =  1, . . 

Finally,  we  show  /^(x)  >  y^(x).  For  the  sake  of  contradiction  suppose  3xo  :  fs{xo)  >  fj^xo). 
Then 

and  thus, 


Vy  <  xo  :  ^nun^  {// (/i(u;),  /^(y  -  w)))  >  ^  mm ^  {//(/i (2),  72(xo  -  2))}  .  (d) 


t 


Next  let  tlie  minimum  of  the  right  side  of  inequality  (4)  be  achieved  a.t  z  =  zq  with  value 

Jnfii^o),Mxo-zo))  =  JI{Ma),Ub)) 


with  fi(zo)  =  /i(a)  and  f2(xo  —  zo)  =  72(6)  for  some  a  <  zo,b  <  xq  —  zq  and  a  +  6  <  xq.  Note 
that  a  and  b  are  obtained  through  the  reduction  of  fi  and  /j.  We  may  also  rewrite  inequality 
(4)  as 

Wy<xo:  min  {H  { f liw),  h{y -w))}  >  n{fiizo)j2ixo-zo)).  (0) 

0<w<y 

But.  with  y  —  a-rb  <Xq  above,  we  get 


min  {H{h{v}),f2{y-w)))  <  7/(/i(a),/2(6))  =  //(/i(^o),/2(3:o  - -0)) 

0<vj<y 


which  contradicts  (5)  and  therefore,  fs{x)  s= 


I 


Thus,  we  have  sliown  that  no  information  is  lost  in  reduction,  since  the  desired  optimal  response 
time  function  of  the  aggregate  Js  is  obtained  using  the  reduced  response  time  functions  of  the 
constituent  tasks.  This  is  an  important  point:  we  will  build  up  response-time  functions  for  complex 
tasks  using  increasing  functions  // .  and  minimization  equations  of  the  form  shown  in  equation  (2). 
Wo  have  just  shown  that  if  we  start  with  reduced  response  time  functions,  then  we  will  construct 
reduced  response  time  functions,  and  the  assignments  associated  with  them  will  be  irreducible. 
The  lemma  can  be  generalized  through  an  easy  induction  argument  for  multiple,  complex  tasks. 

Lemma  3.2  Lei  Si....,Sk  be  k  complex  tasks  with  optimal  response  lime  functions  gi,.-.,3k  find 
II{xi,. .  .,Xk)  be  an  increasing  function  in  each  argument.  Jf  s  is  the  task  that  represents  the 
aggregate  of  tasks  Si,...,Sk  with  reduced  optimal  response  time  function  h(x)  and  defining 

h(x)=  min  {//(ffi(yi),  -  -  ■ 

j/i . Vi-  6(1. i] 

in  +...-ryit  =i 


then  liyx)  =  hfx). 

Remark  3.1  If  the  irreducible  mininiums  of  the  functions  gi,-.-,gk  occur  at  x\.....xk,  then  the 
irreducible  minimutn  of  h,  xq,  satisfies  xq  < 

The  last  rcjnark  '"iplics  that  when  constructing  h  we  may  restrict  our  attention  to  only  those 
assignment  vectors  (2^1 .... ,  yk)  for  which  ^  ULi  *i-  This  will  result  in  improved  c.xccution 

time  for  our  optiinizalion  algorillims  when  <  0(p).  Next,  we  begin  our  presentation  of  tlie 

algoritiuns  by  first  treating  tlie  two  simpler  task  structures,  linear  scries  tasks  and  linear  parallel 
tasks. 


8 


4  Linear  Task  Structures 


Linear  task  structures  are  iuterestiug  both  because  many  pipelines  arc  simple  linear  chains  [25]  and 
because  chains  appear  as  tasks  in  more  complex  task  structures.  We  examine  two  difTcrent  ways  of 
assessing  the  cost  of  a  linear  chain.  The  first  is  when  the  chain  is  a  linear  pipeline,  and  the  response 
time  function  is  the  sum  of  the  response  times  of  each  of  the  ‘stages’  [25].  This  is  called  a  series 
task  structure.  The  second  is  when  the  constituent  tasks  c.Kccule  in  parallel  on  different  aspects 
of  the  same  data  set,  a  parallel  task  structure.  For  both  problems  we  show  how  to  construct  the 
optimal  response  time  function  for  the  aggregate  task,  and,  for  every  </  =  1, . . how  to  recover 
the  optimal  assignment  of  q  processors  from  iufoimation  computed  as  the  response  time  function 
was  constructed. 

In  the  treatments  of  both  problems  we  consider  Si, . .  - ,  Sm  to  be  the  set  of  m  constituent  tasks, 
and  to  be  their  respective  response-lime  functions.  Let  s  be  the  aggregate  task  whose 

optimal  response  time  function  /t(x),0  <  x  <  p,  we  arc  interested  in  computing.  Note  that  each 
constituent  task  Sj  may  already  be  an  aggregation  of  the  elemental  tasks  t,.  Our  immediate  goal  is 
to  construct  the  overall  reduced  response  time  function  for  processors  in  the  range  [1,;>]  and  also, 
to  recover  the  optimal  assignment  when  required. 

4.1  Series  Tasks 

First  we  desciibc  an  algorithm  that  constructs  the  optimal  response  time  function  h{x)  for  linear 
task  structures  when  c.ach  function  g,{x)  is  convex  (sec  [-30],  pp.  'l•^5-15•l)  in  i,  i.c.,  when  the 
efficiency  of  parallelism  is  decrc-asing  (sec  pp.  217  in  (16)  for  an  example).  We  later  trc.at  the 
general  case. 

Let  the  assignment  be  recorded  in  I(s,x)  =  where  Xj  denotes  the  niiml  ’.r  of  pro¬ 

cessors  assigned  to  task  Sj.  also  let  denote  the  response  time  function  created  by  our  algorithm. 
As  a  first  step,  we  must  ensure  that  every  task  s,  is  allocated  enough  processors  ua(.s,)  to  meet 
the  throughput  constraint.  For  each  i  =  1, let  x,  =  bo  this  initial  assignment.  Of 

course,  the  algorithm  terminates  at  this  point  if  >  Pt  because  no  feasible  assignment  c,xisls. 

Note  that  this  first  step  docs  not  require  the  presumed  convc.xity  of  each  i/,.  Let  t  = 
we  set  ltc{x)  =  oo  for  all  x  <  I  to  reflect  an  inability  to  meet  the  throughput  requirement,  set 
/»c(0  =  J^Hd  let  X  =  1.  Next,  for  each  s,,  compute  d{i,x,)  =  <7,(x,  t  1)  -  <7«(ii).  the 

change  in  response  lime  achieved  by  allocating  one  more  processor  to  s,.  Build  a  max  priority  heap 
[20]  where  the  priority  of  s,  is  |d(i,x,)|.  Finally,  enter  a  loop  where,  on  each  iteration, 

•  The  task  (say  sj)  with  highest  priority  is  allocated  another  processor. 

•  Let  a  denote  the  number  of  processors  previously  asvigued  to  Sj.  Ccmpulc  /ig(*)  =  ~ 

l)*h</(j.a),  and  set  J(s.x)  =  (x|,...,ij -{-  1,...,xa.). 

•  fnerement  x. 

•  Compute  Sj’s  new  priority,  and  adjust  the  priority  heap  acconlingly. 
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We  iterate  until  all  available  processors  have  been  assigned,  or  the  top  clement  of  the  heap  is  non- 
negative,  i.e.,  d{j,Xj)  is  non-negative.  If  the  top  clement  becomes  non-negative  when  x  =■  y,  then 
we  assign  hai^)  =  heiv  —  1)  and  l{s,  s)  =  /(s,y  —  1)  for  all  r  =  . . .  ,p. 

Each  iteration  of  the  loop  allocates  the  nc.xt  processor  to  the  task  which  stands  to  benefit  most 
from  the  allocation.  When  the  individual  task  response  functions  are  convc.\,  then  the  greedy 
response  time  function  he  it  produces  is  optimal,  and  is  irreducible. 

Prop.  4.1  Suppose  that  gi{k)  is  convex  over  z  £  {l.p),  for  all  z  =  1 _ ,n.  Then  for  all  x  G  [l,p], 

lic{x)  =  h(x),  the  optimal  response  time  function.  Furthermore,  hc{x)  is  irreducible. 

Proof:  Clcarl}',  each  task  s,  must  receive  at  least  ua(s,)  tasks  in  order  for  the  throughput 
condition  to  be  satisfied.  Recalling  that  t  =  “aC-Si).  it  is  clear  that  =  l^i^)  =  oo 
for  all  X  £  (l.Z— l).  Now  consider!  =  t.  For  all  j  =  l,...,p— t  the  remainder  of  the  algorithm 
should  assign  ‘•'the  nc.'ct”  j  processors  in  such  a  way  to  obtain  the  ma.ximal  possible  decrease 
in  response  time  given  j  additional  processors.  The  proposed  algorithm  docs  e>:actly  th.at. 
D  =  {dO',x,-hj)ll  <  i  <  n,l  <  j  <  p-x)  is  the  set  of  all  possible  changes  for  the  remainder  of 
the  assignment.  For  every  j  =  I,... ,p-  t,  the  maximal  decrease  is  obtained  by  choosing  the 
j  largest  (in  magnitude)  elements  of  D.  Since  each  g,  is  convex,  |t/(z,i,  -rii)|  <  "r  J2)j 

for  ji  >  j2  (see  (30),  pp.  d53-!.5‘I)  and  so  the  j  elements  with  largest  magnitude  in  D  arc 
selected  as  given  in  the  algorithm. 

The  irrcducibility  of  he  follows  from  its  construction. 

■ 


The  complexity  of  this  algorithm  is  low.  The  throughput  condition  is  checked  in  m  steps. 
The  initial  priority  heap  is  constructed  in  O(ntlogm)  time;  the  highest  priority  heap  element  is 
found  in  0(1)  time  and  each  heap  adjustment  requires  only  0(Iog»u)  time  using  standard  heap 
algorithms.  Thus  the  overall  complaxity  is  0(i»logm)  —  O(plogni)  =  0{p\o^m).  This  is  an 
example  of  how  the  slruciiire  of  the  response  time  function  (convexity)  can  be  used  to  obtain 
higher  algorithmic  efficiency  than  might  otherwise  be  acliicvable,  as  we  will  sec  below  for  general 
response  time  function.*:. 

different  appro.ach,  based  on  dynamic  programming,  is  needed  when  the  task  response  time 
functions  arc  not  convex.  In  fact,  we  anticipate  tfial  this  condition  will  be  the  norm  when  con¬ 
sidering  chain.s  whose  tasks  arc  themselves  aggregates  of  other  tasks.  Since  convexity  need  not  be 
preserved  in  aggregation,  wc  must  turn  to  asliglilly  more  complicated  algorithm.  The  new  approach 
has  a  higher  comjdcxity  ©(znp-)-  but  it  permits  completely  general  response  lime  functions.  Wc 
will  show  that  certain  algorithmic  efficiencies  arc  possible  when  bounds  on  the  least  minimums  arc 
known  ahead  of  time. 

For  any  j  =  i, . . m,  wc  can  view  the  subchain  sj...  .,Sj  as  a  (larger)  task  itself.  We  will  call 
this  ln.sk  Sj,  and  compute  its  optimal  response  lime  function:  for  *  =:  let  Gxij.x)  be  the 

minimal  response  lime  of.?,,  subject  to  throughput  constraint  A,  achievable  when  no  more  than 
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X  processors  are  allocated  to  it.  The  function  Gx{j,-)  is  thus  Sj's  optimal  response  time  function; 
in  computing  this  function  we  will  simultaneously  check  the  throughput  constraint — hence  the 
subscript  A.  Using  the  principle  of  optimality [12],  we  may  write  a  recursive  definition  for  Gx{j,x) 
as  follows. 


Gx{j,x) 


oo  if  u\{sj)  +  uxiSj-i)  >  X 

ffi(x)  if  i  =  1  and  ux(sr)  <  x  (e) 

min  +  Gx(j  -  1,  a;  -  i)}  otherwise. 


These  equations  define  response  time  to  be  oo  whenever  insufficiently  many  jnocessors  are  allocated 
to  Sy  or  Sj-i  to  meet  the  throughput  constraint;  we  define  ux(So)  =  0  as  a  boundary  condition. 
Observe  that  7i(x)  =  Gx(m,x).  Note  that  the  JI  function  (Lemma  3.2)  is  the  ‘sum’  operator  here, 
in  the  third  part  of  the  equation. 

The  dynamic  programming  equation  is  more  intuitively  explained  by  reading  it  ‘top  down’. 
Suppose  we  had  somehow  computed  the  response  time  table  for  the  first  j  -  1  tasks  (the  ‘large’ 
task  i.e.,  Gx{j,x).  Then,  given  x  processors  to  distribute  between  tasks  Sj  and  5j_i,  we  try 
every  combination  subject  to  the  throughput  constraints:  i  processors  for  Sj  and  x  -  i  processors  for 
Sj-\.  Since  the  equation  is  written  as  a  recursion,  the  computation  will  actually  build  response  time 
tables  for  larger  tasks  ‘bottom  up’,  starting  with  task  si  in  the  second  part  of  the  equation.  Note 
that  similar  explanations  may  be  given  for  the  dynamic  programming  equations  that  appear  later 
in  the  paper.  The  optimal  assignment  of  q  {1  <  q  <  p)  processors  to  tasks  is  found  by  setting  the 
appropriate  value  of  J  as  we  solve  for  the  value  Gx{j,x).  Suppose  that  i  solves  Gx{j,x)  =  + 

Gx{j  -  1,$  -  i).  Then  we  set  I{Sj,x)  =  {xi,. .  .,Xj-i,i),  where  I{Sj-i,x  -  i)  =  (a-'i, .  .-.,xj-i). 

An  important  consequence  of  Lemma  3.2  is  that  each  function  Gx{j,  •)  (and  hence  each  assign¬ 
ment  I{Sj,x))  is  irreducible.  This  follows  directly  from  the  fact  that  equation  (6)  has  the  form 
specified  by  equation  (3).  The  more  complex  bounds  on  the  minimum’s  index  variable  in  equa¬ 
tion  (G)  serve  simply  to  keep  the  index  i  away  from  regions  where  either  ^j(-)  or  Gx{j  -  1>  )  are 
known  to  take  value  oo. 

If  we  have  already  solved  for  the  minimal  response  time  function  Gx{j  -  !>•))  we  may  use 
equation  (6)  to  determine  GxU^  ')•  cost  of  determining  one  individual  Gx{j,  x)  value  is  seen  to 
be  0{x)  =  0{p)]  the  cost  of  determining  the  whole  function  G'a(j,  •)  is  thus  0{p‘^),  and  the  cost  of 
determining  all  such  functions  (and  hence  the  desired  response  time  function  Gx{^n,  •))  is  0{mp'^). 

The  application  of  the  above  dynamic  programming  procedure,  in  equation  (6),  is  illustrated 
in  Figure  1  (which  shows  the  computation  of  Gx{ji  •))  for  a  task  structure  with  three  tasks.  The 
response  time  functions,  (ji{x),  for  the  three  tasks  t\,i2  and  <3  are  taken  from  Table  1  and  the 
throughput  constraint  A  =  1/40.  Since  we  use  tasks  from  Table  1,  we  revert  to  using  for  the 
constituent  tasks.  The  first  column  of  the  table  identifies  the  aggregated  task  for  1  <  j  <  3; 
here  =  ti,  S'2  =  {h^h)  and  Sz  =  {t\,t2,tz)-  A  row  j  corresponds  to  the  response  time  function 
GxijfX),  for  aggregated  task  Sy,  entry  [A:,/]  in  the  table  (row  k,  column  /)  gives  the  value,  and 
the  corresponding  assignment,  for  Gx{k,l).  The  last  row  shows  the  assignment  produced  by  the 
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Figure  1;  Application  of  Algorithm  for  series  tasks;  G\{j,x)  for  1  <  j  <  3  ,  1  <  a;  <  8 


algorithm;  this  assigns  3  processors  to  tasks  ti  and  <2  S'Hd  2  processors  to  with  minimum  response 
time  of  30.5  and  an  achieved  throughput  of  1/14.  Note  that  in  our  example  above,  and  in  all  other 
examples  to  follow,  we  have  omitted  the  dummy  task  that  is  the  last  task  executed  on  the  data  set, 
since  it  plays  no  role  in  the  computation. 

The  dynamic  programming  equations  can  sometimes  be  solved  more  efllciently,  when  each  g,  has 
an  irreducible  minimum  at  z„  and  each  z,  is  small  relative  to  p.  Suppose  z,  <  I  for  all  t  =  1, . . . ,  m. 
W’q  next  show  how  the  optimality  equations  can  be  solved  in  0{m'^L‘^)  time.  This  is  advantageous 
when  L  <  G{ply/m). 

As  we  solve  for  each  G\{j,  k),  Remark  3.1  also  tells  us  that  we  need  not  consider  assigning  any 
more  than  Zj  <  L  processors  to  Sj.  This  means  we  can  rewrite  the  optimality  equations  as 


[  00  if  ux{sj)  +  uxiSj-i)  >  X 

Gx{j,x)  =  <  ,  if;  =  1  and  ux{si)  <  x 

I  miq  +  Gx{j  -  1,  -  0}  otherwise. 

(7) 

The  complex  lower  bound  on  i  prohibits  indexing  values  of  i  such  that  cannot  meet  the  through¬ 
put  constraint,  and  values  indexing  beyond  Sj's  known  minimum.  Thus,  the  cost  of  computing 
Gx{j,  x)  is  only  0{L).  Since  we  need  only  compute  G,\{j,  k)  for  x  <  Zj,  the  cost  of  computing 
G’a(;,-)  is  0{jL%  so  that  the  cost  of  solving  the  overall  problem  is  =  0{rn^L'^). 


4.2  Parallel  Tasks 

In  this  subproblem,  w’e  have  a  sequence  S  of  tasks  sj,. .  .,5,n  with  irreducible  response-time  func¬ 
tions  f)i,.  ■  •  for  w'hich  we  need  to  determine  the  irreducible  optimal  lesponse-time  function  Ti[x) 
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for  the  maximum  where 


h{x)  =  min  max{ ji(a:i),  02(^2),  •  •  • , 

Xj[ } . . . )  Xm 

«!  +  •  •  •  +  «m  =  a? 

In  this  case,  the  function  II  (in  Lemma  3.2)  is  the  maximum  operator.  The  basic  idea  behind  the 
algorithm  is  that  after  processors  are  ^.llocated  to  meet  the  throughput  requirement,  we  can  only 
dri  v'e  the  maximum  response  time  down  by  allocating  a  processor  to  the  task  whose  response  time 
under  the  present  allocation  is  maximal.  This  process  is  repeated  until  the  maximum  number  of 
needed  processors  is  allocated.  This  idea  is  now  made  more  precise. 

Suppose  that  the  irreducible  minimum  of  each  gi  occurs  at  2,,  and  let  Fiist, 

observe  that  the  response  time  function  value  at  all  processor  counts  smaller  than  t  =  ma(sz) 
is  00.  Thus,  for  i  =  1, . . .,  m,  we  begin  by  assigning  ux{s,)  processors  to  task  s,.  This  is  also  reflected 
in  the  initialization  of  the  data  structure  recording  assignments,  cis  I(S,t)  =  (ua(si))  •  •  •  >  wa(sot))- 
Set  h{x)  =  00  for  a*  =  1, . .  .,t  -  1,  and  h{t)  =  maxi<,<„i{^,(uA(si))}-  Next  build  a  m'.x-priority 
heap  on  the  tasks,  where  5,(ua(s,))  is  the  priority  for  task  s..  Let  x  =  t  +  1,  and  enter  a  loop  where 
the  following  is  performed  for  at  most  Zh  —  t  iterations. 

•  Give  an  additional  processor  to  the  task  whose  priority  is  greatest.  Let  be  that  maximal 
priority. 

•  If  that  task  (say  s.)  was  previously  assigned  x,  processors,  and  if  x,  =  2,,  then  terminate  the 
algorithm. 

•  If  that  task  (say  s,)  was  previously  assigned  x,  <  2,  processors,  reset  its  new  priority  to 
5;(x;  +  1).  Set  I{S,x)  =  (xi,...,x,-  +  l,...,Xm),  where  I{S,x-  1)  =  (xi, . . ., x,-, . .  .,x,„). 

•  Adjust  the  max-priority  heap  to  reflect  the  task’s  new  priority,  and  set  h{x)  to  the  maximum 
value  in  the  heap. 

•  Increment  x. 

If  the  loop  terminates  with  x  =  y,  then  set  h{z)  =  h{y  -  1)  and  I{S,z)  =  I{S,y  -  1)  for  all 

.2  =  y,...,p. 

The  termination  condition  follows  from  the  observation  that  if  s,  has  the  maximum  response 
time  but  already  has  2,  processors  assigned,  rro  further  assignment  of  processors  to  s,  can  reduce 
its  response  time.  Since  the  objective  function  is  the  maximum  response  time  among  tasks,  that 
objective  function  cannot  be  further  reduced.  It  is  clear  then  that  the  procedure  we  describe 
constructs  an  irreducible  function.  The  algorithm’s  correctness  is  established  with  the  following 
lemma. 

Lemma  4.1  For  every  x  =  t,. .  .,j),  h{x)  =  ^(x)  =  y^. 
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Proof:  For  every  i  =  1, . . . ,  m,  let  5,  =  I  ^  ,  Zi}  be  the  set  of  feasible  response 

times  for  s,  following  its  initial  assignment,  and  let  S  —  Since  the  objective  function 

value  for  air  assignment  is  the  maximum  response  time  under  that  assignment  and  since  we 
stop  assigning  processors  once  the  objective  function  can  no  longer  be  minimized,  S  contains 
every  value  of  ijx  generated  by  our  algorithm.  Furthermore,  the  sequence  i/tii/t+i)  •  ■  de¬ 
scribes  the  elements  of  S  in  descending  order.  Now  if  an  assignment  is  to  achieve  cost  ijx, 
the  response  time  of  every  task  must  be  no  greater  than  y^-  We  argue  that  our  algorithm 
finds  an  assignment  achieving  cost  y^,  using  the  minimum  number  of  processors.  For  every  y, 
let  T{yt)  be  the  task  from  whose  response-lime  function  y,  is  taken.  Our  algorithm  allocates 
an  additional  processor  to  T(yi),  then  another  to  T{y2),  and  so  on.  For  every  .r  =  t,...,2/j 
and  j  =  1, . .  .,?n  let  Pj{x)  be  the  number  of  elements  ?/„  with  a  <  x  for  which  T{ya)  =  Sj. 
Pj{x)  is  thus  the  number  of  additional  processors  our  algorithm  has  allocated  to  Sj  by  the 
{x  -  pass  through  the  loop,  and  is  also  the  minimum  number  of  additional  processors 
(after  ua(Sj))  that  Sj  must  be  assigned  if  its  response  is  to  be  no  greater  than  y^-  As  this  is 
true  for  every  task  for  every  y^,  it  follows  that  the  assignment  generated  by  our  algorithm 
achieves  each  cost  yx  with  the  minimum  number  of  processors.  The  lemma’s  conclusion  is  a 
restatement  of  this  fact.  I 


Since  the  algorithm’s  loop  is  executed  at  most  -  i  times,  the  overall  cost  of  the  algorithm  is 
0(?nlog77r  -)-  i/jlogm).  The  optimal  assignment  is  found  in  /(5,p).  An  example  of  the  application 
of  this  algorithm  is  shown  in  the  next  section;  in  Figure  2  the  row  for  Bi  shows  the  response  time 
function  (and  the  corresponding  assignment)  of  a  parallel  task  composed  of  tasks  ii  and  t2- 

While  the  problems  studied  in  this  paper  are  distinctly  difTerent  from  those  addressed  in  the 
literature,  a  closer  look  reveals  that  the  above  algorithm  (for  parallel  tasks)  is  a  generalization 
of  the  algorithm  independently  conceived  in  [27].  While  they  address  the  problem  of  finding  a 
nonpremptive  schedule  for  a  set  of  n  independent  tasks,  i.c.,  parallel  tasks,  their  algorithm  in  fact 
finds  an  assignment  which  satisfies  the  feasibility  conditions  of  our  problem.  Our  algorithm  is  a 
generalization  in  the  sense  that  thc>  do  not  “construct”  a  reduced  response  time  lablc  for  the  entire 
parallel  task  that  provides  the  response  time  as  a  function  of  the  number  of  processors.  This  is 
essential  for  our  solution  technique  which  views  complex  task  structures  as  composition  of  simpler 
task  structures. 

5  Complex  Tasks 

The  algorithms  we  have  developed  to  analyze  series  and  parallel  task  structures  can  be  used  to 
analyze  task-structures  whose  graphs  form  trees,  or  series-parallel  graphs.  We  now  show  how  the 
rcspoirse  time  function  for  a  tree  task  w'ith  ii  nodes  and  arbitrary  branching  is  computed  in 
time,  and  how'  a  series-parallel  task  with  arbitrary  branching  is  analyzed  in  O(np^)  time.  Note  that 
the  complex  tasks  we  consider  usually  determine  a  whole  pipeline  computation  and  thus,  we  will 
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henceforth  use  n  (as  in  Section  2)  to  denote  the  number  of  nodes  in  the  task  graph.  Series-parallel 
graphs  arise  frequently  in  applications  where  data  in  a  set  is  split,  processed  separately,  and  then 
rejoined.  The  basic  idea  behind  our  algorithms  is  that  these  complex  structures  can  be  viewed  as 
a  composition  of  series  and  parallel  tasks,  thus  facilitating  the  use  of  the  algorithms  designed  thus 
far. 


5.1  Ti-ee  Tasks 

Suppose  the  precedence  graph  for  V  forms  a  tree  with  n  nodes.  Either  out-trees  (edges  directed  to 
child  nodes)  or  in-trees  (edges  directed  to  parent  node)  are  permissible.  Without  loss  of  generality 
(because  path  lengths  are  unaffected  by  arc  direction)  our  discussion  will  concern  out- trees. 

For  notational  convenience  we  assume  that  every  non-leaf  node  has  exactly  b  children;  our 
approach  extends  immediately  to  the  general  case.  For  every  task  Sj,  let  Cj^i, . .  be  Sj’s 
children.  Sj  is  the  root  of  a  subtree  which  can  be  viewed  as  a  subtask  Tj  with  its  own  response 
time  function.  Dynamic  programming  again  expresses  the  optimal  response  time  function  for  each 
Tj.  The  optimal  response  time  function  for  Ti  is  the  overall  problem  solution. 

Let  Gx{j,x)  be  the  optimal  response  time  achievable  by  Tj  when  subject  to  throughput  con¬ 
straint  A.  Let  I  be  the  set  of  interior  tree  tasks,  and  £  be  the  set  of  leaf  tasks.  The  principle  of 
optimality  states  that 

{00  if  Sj  e  £  and  ux{sj)  >  x 

min  {/i(«o)+  max  {C?A(ci,i,.i;{)}}  otherwise. 

xo,...,xb  1  <  t  <  6 

Co  +  ■  ■  ■  +  Sfc  =  fc  “  ~ 

The  formidable  recursive  e.xpre5&ion  simply  takes  the  minimum  cost  over  all  possible  partitionings  of 
/v  processors  among  Sj  and  the  b  subtrees  rooted  in  its  children.  Fortunately,  the  results  developed 
in  Section  §4  may  be  employed  to  solve  this  equation  efficiently.  The  subtasks  Cj,\  through  Cj,6  form 
a  single  parallel  tsisk,  D.  The  algorithm  developed  in  the  previous  section  constructs  B's  irreducible 
response  time  function  in  0(2jlog6)  time.  Next  we  can  view  X)  as  a  series  task,  composed  of  Sj 
and  B.  Given  B's  response  time  function,  Tj’s  irreducible  response  time  function  is  computed  in 
additional  time  using  the  algorithm  described  in  Section  §4.1.  Thus,  the  cost  of  computing 
the  serial  composition  dominates.  The  complexity  of  computing  response  time  functions  for  all  T, 
where  s_,  €  £  is  Note  however  that  h\X\  =  n,  which  implies  that  the  total  cost  of 

processing  interior  tasks  is  0{ni?  jb).  Since  the  cost  of  processing  all  leaf  tasks  is  0(n),  the  total 
cost  in  the  general  case  is  Oiji'iP'lb). 

The  procedure  is  illustrated  by  the  example  in  Figure  2,  a  tree  with  5  constituent  tasks;  here 
A  =  1/40.  The  tasks  ti,f2  form  a  parallel  task,  denoted  Bi\  Bi  and  form  a  series  task,  denoted 
Tb.  Similarly,  the  aggregate  task  T3  and  form  a  parallel  task  i?2;  B2  and  <5  form  a  series 
task  Tg  whose  response  time  gives  us  the  response  time  of  the  entire  task.  Note  that  the  tasks 
hf--yh  taken  from  Table  1.  Each  row  of  the  table  shows  the  response  time  assignment  for 
the  coi responding  aggiegated  task.  The  minimum  response  tijue  achieved  by  the  assignment  is  41 
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task 

X 

aggregates 

5 

6 

7 

8 

16 

14 

11 

11 

ihih) 

(2,3) 

(3,3) 

(3,4) 

(4,4) 

Tz 

31 

26 

21.5 

19.5 

(2,2,1) 

(2,3,1) 

(2,3,2) 

(2,3,3) 

Bz 

iU>Tz) 

Ts 

65 

54 

46 

41 

{UiBz) 

(1,1, 1,1,1) 

(1,2,1, 1,1) 

(2,2,1,1,1) 

(2, 3,1, 1,1) 

Figure  2:  Application  of  Algorithm  for  Tree  Structures 

(by  assigning  2  processors  to  3  to  tz  and  one  processor  to  each  of  the  other  three  tasks)  and  the 
achieved  throughput  is  1/20. 

Better  complexities  are  achievable  when  the  irreduclbk  minima  z,  for  each  Sj  satisfy  Zt  <  L 
where  L  C  p.  The  computation  of  B's  response  time  Unction  is  fast — C  [hL  log h)  time.  For  Sj  +  B, 
let  zTj  be  the  sum  of  the  z,  values  for  all  nodes  in  the  subtree  rooted  in  Sj.  Since  we  need  not 
consider  any  assignment  that  gives  more  than  z,  processors  to  Sj,  the  response  time  function  for 
Sj  +  B  is  computed  in  0{zt^L)  time.  This  cost  dominates  that  of  computing  B's  response  time 
function,  provided  that  blogb  <  L,  v/hich  we  will  assume  here  for  simplicity. 

The  total  cost  of  analyzing  the  tree  is  maximized  when  each  Xtj  is  as  large  as  possible.  This 
occurs  when  the  tree  is  actually  just  a  linear  chain,  in  which  case  X^,,  =  L,  Xt„_i  =  2L,  Xt^^^  = 
3L,  and  so  on.  As  we  have  seen,  the  total  cost  is  then  0{n'^L‘^).  The  best  topology  is  a  full  tree; 
for  example,  consider  a  full  binary  tree.  A  subtree  T,  consisting  of  exactly  3  tasks  has  xj-  <  3L, 
and  an  analysis  cost  bf  0{dL^).  n/2  such  subtrees  are  analyzed.  Then,  ?t/4  subtrees  are  analyzed 
where  xs  <  L  +  ZL  +  3L  =  IL.  Each  of  these  requires  0{7L^)  time  to  analyze.  Continuing  in  this 
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fasliion  we  determine  a  complexity  bound  of 

logn 

5(2’^  “  =  0{L‘^n\ogn). 

.=1  ^ 

5.2  Series-Parallel  Tasks 

Filially,  we  consider  series-parallel  task  graphs.  We  show  that  the  respo  nse  time  fu.iction  for  such 
a  graph  (with  n  nodes)  can  be  computed  i 0{njP)  time.  A  numbei  d  different  but  equivalent 
definitions  of  series-parallel  graphs  exist.  The  one  we  will  use  is  taken  from  [42],  which  studies 
verLex  series-parallel  DAGs.  However,  based  on  their  results  on  the  equivalence  of  edge  series- 
parallel  DAGs  and  vertex  series-parallel  DAGs,  we  use  the  term  series-parallel  to  mean  both  cases 
and  use  their  definition  of  vertex  series- parallel  DAGs.  A  series-parallel  DAG  (SF)  is  defined 
recursively  as  follows. 

1.  (i)  The  DAG  having  a  single  vertex  and  no  edges  is  SP. 

2.  (ii)  If  Gi  =  (Vi,Ei)  and  G2  =  are  tv/o  SP  DAGs,  so  are  the  DAGs  constructed  by 

each  of  the  following  two  operations: 

(a)  Parallel  composition:  Gp  =  (Fj  U  V2,Ei  U  E2). 

(b)  Series  composition:  Gs  —  (Pi  U  V2,Ei  1.  E2  U  {T\  X  S2)),  where  Tj  is  the  set  of  sinks  of 
G\  and  S2  is  the  set  of  sources  of  G2- 

A  node  U  in  G  =  {V,E)  is  a  sink  if  there  are  no  outgoing  edges  from  t,,  i.e.,  there  is  no  edge 
(t,,tj)  in  E.  A  node  t,  is  a  source  if  there  are  no  incoming  edges  to  the  node,  i.e.,  there  is  no  edge 
{tj,ti)  in  E.  It  is  shown  in  [42]  that  any  SP  DAG  can  be  parsed  as  a  binary  decomposition  tree 
(BDT).  Figure  3  illustrates  a  series-parallel  graph,  and  the  BDT  that  represents  the  graph.  The 
internal  nodes  are  labeled  5,  or  P,  to  denote  the  series  or  paiallel  composition.  There  is  a  one-to-one 
correspondence  between  BDT  leaves  and  DAG  nodes.  Each  inteuial  BDT  node  a  repiei  Is  either 
a  scries  (labeled  S)  or  parallel  (labeled  P)  composition  of  two  SP  subgraphs  rcpresuitvu  by  the 
subtiees  rooted  in  a.  For  example,  supp.i-  as  subtrees  are  .‘imply  leaf  nodes.  The  corresponding 
nodes  in  the  DAG  are  SP  graphs,  composed  by  t!ic  opc...k  ,i.'C.:ifieJ  in  a’s  label,  a  can  be  thought 
to  be  representing  that  composition.  Now  if  a’s  B.D .  parent  is  some  node  q  and  q  has  another 
child  a',  then  we  know  that  a'  represents  a:;  SP  subgraph  of  the  original  DAG,  and  q  represents 
the  series  or  parallel  composition  of  the  subgraph.,  .^presented  by  a  and  by  a'.  A  BDT  thus  shows 
the  selection  and  ordering  of  co.apositions  necessary  to  establish  that  the  original  DAG  is  SP  with 
respect  to  the  definition  above. 

There  is  an  obvious  correspo.ndence  between  SP  compositions  and  the  methods  we  have  devel¬ 
oped  to  compute  response  time  fU).ctions  for  seiic..  and  parallel  task  structures.  If  we  think  of  an 
SP  DAG’s  nodes  as  representing  tasks,  a  seiies  composition  corresponds  to  the  aggregation  of  tw-i 
tasks  into  a  series  task  structure;  two  tasks  are  replaced  by  one,  and  the  serial  edge  between  them 


17 


ti  ^2 


(a)  A  series-parallel  gra]jh 


52 


Pi  tz  U  h 


h  h 


(b)  Binary  decomposition  tree 
Figure  3:  A  Series-Parallel  Graph  and  corresponding  BDT 
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Task  a;fgregates 

5 

Number  of  proce..sors 

6  7 

8 

Pi 

gmmm 

14 

11 

11 

parallel:(ti,  h) 

i^tlH 

(3,3) 

(3,4) 

(4,4) 

31 

19.4 

(2,2,1) 

(2,3,3) 

P2 

10 

10 

9 

jgmmm 

parallel:(t,i,t5) 

(3,2) 

(3,3) 

(4,3) 

G  =  S2 

70 

59 

51 

46 

serial:  (si,p2) 

(1,1, 1,1,1) 

(1,2,1,1,1) 

(2,2,1,1,1) 

(2,3,1,1,1) 

Table  2:  Computation  of  Response  times  for  series-par;.  Jel  structures 

disappears.  Similarly,  a  parallel  composition  corresponds  to  tlie  aggregation  of  a  set  of  tasks  into  a 
parallel  task  structure.  It  is  thus  quili  straightforward  to  construct  the  response  time  function  for 
a  series-parallel  gra-ph,  once  the  associated  BDT  is  known.  Starting  with  the  individual  tasks’  re¬ 
sponse  lime  functions,  we  compose  response-time  functions  in  the  order  specified  by  the  BDT.  The 
response  time  functions  created  during  intermediate  steps  represent  aggregate  subtasks  in  much 
the  same  way  as  task  T,  represented  an  entire  subtree  in  Section  §5.1.  Likewise,  the  optimal  as¬ 
signment  is  recovered  by  backtracking  through  intermediate  optimal  assignments  in  the  same  way 
as  was  described  for  trees. 

An  application  of  our  procedure,  for  the  series-parallel  graph  in  Figure  3,  Is  shown  in  Table  2  for 
throughput  constraint  A  =  1/40.  L«ch  row  shows  the  response  time  func'lo:.,  and  corresponding 
assignment,  for  the  aggrego.le  task  formed  by  a  series  or  parallel  composition.  Fr.r  <xample,  the 
row  labeled  Si  corresponds  to  the  aggregate  task  formed  by  the  series  composition  of  P-  (which  is 
a  parallel  composition  of  ti  and  12)  and  The  minimum  response  time  in  the  above  assignment 
is  46  (assigning  2  processors  to  tj,  3  to  i2  and  one  processor  each  to  and  tg)  and  the  aohievc-.’. 
throughput  is  1/20. 

On.-  the  BDT  is  known,  the  cost  of  determining  the  optimal  assignment  is  0{np^),  as  every 
respon,  ’-lime  function  composition  has  cost  0{jr)i  there  are  at  most  n  such  compositions  per¬ 
formed.  As  we  have  seen  before,  the  cost  is  reduced  to  O(L^Jilogn)  when  the  irreducible  minima 
a,  for  each  s,  satisfies  z,  <  L.  It  is  sl.owii  in  [42]  that  a  BDT  can  be  constructed  time  proportional 
to  the  number  of  _dges  which  is  O(n^)  time.  Since  we  assume  n  <  p,  the  0{np'^)  analysis  cost 
dominates  the  procedure. 

6  The  Throughput  Problem 

la  computations  where  the  input  data  rates  must  be  ma.ximized  to  handle  real  time  constraints,  the 
o’ojcctivc  of  the  system  is  to  achieve  a  high  throughput.  Typically,  there  is  a  limit  on  the  amount 
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of  time  the  system  can  take  to  process  a  single  data  set,  i.e.,  the  response  time.  Under  these 
conditions  the  objective  of  an  assignment  becomes  maximization  of  the  throughput  subject  to  a 
specified  ri  spouse  time  requirement.  We  have  referred  to  this  problem  as  the  throughput  problem. 
In  this  section  we  show  how  solutions  to  the  response-time  problem  can  be  used  to  solve  the 
throughput  problem.  If  one  can  solve  the  response-time  problem  for  a  given  pipeline  computation 
in  0(C(7i,p))  time,  then  one  can  solve  its  throughput  problem  in  O(7iplog(pn)  -f  log(7ip)C(7i,p)) 
time. 

Our  approach  depends  on  the  fact  that  minimal  response  times  behave  monotonically  with 
respect  to  the  throughput  constraint. 

Lemma  6.1  For  any  pipeline  computation  V  =<  K,T,F,G  >,  let  p{X)  be  the  minhnal  possible 
response  time  ofV,  given  throughput  C07istraint ) .  Then  p(.\)  is  a  monotone  non-decreasmg  function 
of  A. 


Proof:  Recall  that  u,\(t,)  is  the  minimum  number  of  processors  required  for  task  t,  to  meet 
throughput  constraint  A.  For  everj'  WA(^t)  is  clearly  a  monotone  non-decreasing  function  of 
A.  Call  an  assignment  >1  A-/  asioie  if,  for  alH  =  1, . . .,  7i.  it  eissigns  at  least  u.\{t,)  processors 
to  Finally,  let  A\  be  the  set  of  all  A-feasible  assignments.  Whenever  Ai  <  A2,  we  must 
liave  A\2  Q  because  of  the  monotonicity  of  each  7i\(t,).  Since  p{\)  is  the  minimum  cost 
among  all  assignments  in  .4,\,  we  have  pi^f)  <  p(Ai).  I 


This  result  can  be  viewed  as  a  generalization  of  Bokhari’s  graph-based  argument  for  monotonicity 
of  the  minimal  “sum”  cost,  given  a  “bottleneck”  cost  [5]. 

Suppose  for  a  given  pipeline  computation  we  are  able  to  solve  for  /7(A),  given  any  A.  The  set  of 
all  possible  throughput  values  is  {1/ f,(k)  |  f  =  1, A:  -  1, . . 0{pnlog{p7i))  time  is  needed 
to  sort  them.  Now  suppose  a  response  time  constraint  /,  is  given.  For  any  given  throughput  A  we 
may  compute  p{X),  and  determine  whether  p(X)  <  p.  p{X)  is  monotone  in  A,  which  permits  us  to 
perform  a  binary  search  over  the  sorted  space  of  throughputs  and  identify  the  greatest  one,  say  A*, 
for  which  /7(A")  <  p.  The  assignment  associated  with  /^(A*)  is  the  one  maximizing  throughput  using 
p  processors,  subject  to  response  time  constraint  p.  If  the  cost  of  solving  one  response-time  problem 
is  0{C{7i,p)),  then  the  cost  of  solving  the  throughput  problem  is  0{p7i\og{p7i)  -f  C(7i,77)log(p7i)). 

Lemma  6.2  Let  V  be  a  pipclme  co77iputalion,  and  suppose  that  the  co7nplexit7j  of  solving  the 
I'csponse-twie  pivblem  for  V  is  0(C{7i,p)).  Then  the  complexil7j  of  solving  the  tlp'oughput  proble7n 
for  'P  is  0{p7i\og{p7i)  -}-  C(7i,v)  l0g(277l)). 

When  solving  the  response  time  problem,  we  typically  compute  an  entire  response  time  function, 
which  essentially  gives  the  “answer”  (minimal  response  time)  for  a  whole  range  of  processprs.  When 
we  solve  the  throughput  problem  in  the  manner  just  described,  we  compute  a  single  answer,  for  a 
single  processor  couni.  If  we  desire  a  range  of  throughputs  for  a  range  of  processors,  we  need  to 
repeat  the  procedure  above  once  for  every  processor  count. 


20 


0-0-  o-  o-  o-  o-  o-  o-  o 

tl  12  n  U  /5  IQ  tl  iS  to 

Figure  4:  Computation  Flow  for  Motion  Estimation 

The  complexity  of  the  algorithms  for  the  throughput  problem  are  seen  to  be  higher,  by  a 
logarithmic  factor,  than  those  for  the  response  time  problem.  For  example,  the  complexity  for  serial 
task  structures  is  seen  to  be  O(njPlognp^)  =  0(7ip^logp)  which  has  increased  by  a  logarithmic 
factor.  Future  endeavors  include  the  pursuance  of  more  efficient  algorithms  for  the  throughput 
problem. 

7  An  Application 

In  this  section  we  illustrate  our  methods  by  considering  an  application  requiring  pipelined  execution 
-  a  motioir  estimation  system  in  computer  vision.  Motion  estimation  is  an  important  problem  in 
computer  vision  in  which  the  goal  is  to  characterize  the  motion  of  moving  objects  in  a  scene.  ^From 
a  computational  point  of  view,  continually  generated  images  from  a  camera  must  be  processed  by 
a  number  of  tasks.  In  order  to  process  the  images  (data  sets),  throughput  and  response  time 
constraints  are  imposed  on  the  tasks  and  therefore,  the  appropriate  model  of  computation  is  a 
pipeline  computation.  The  application  itself  is  described  in  detail  in  [11,  28]  It  should  be  noted 
that  there  are  many  approaches  to  solving  the  motion  estimation  problem.  We  are  only  interested 
in  an  example,  and  therefore,  the  following  algorithm  is  not  presented  as  the  only  or  the  best  way  to 
perform  motion  estimation.  A  comprehensive  digest  of  papers  on  the  topic  of  motion  understanding 
can  be  found  in  [31].  The  following  subsection  briefly  describes  the  underlying  computations. 

7.1  A  Motion  Estimation  System 

Figure  4  shows  the  task  structure  of  our  motion  estimation  system  [11]  -  a  linear  task  structure. 
The  data  sets  input  to  the  task  system  are  a  continuous  stream  of  stereo  image  pairs  of  a  scene 
containing  the  moving  vehicles.  The  required  output  is  a  list  of  3-dimonsional  points  (or  features) 
that  describe  the  motion  at  each  time  step. 

The  system  consists  of  nine  major  tasks: 

1.  Task  tl-  The  first  task  performs  2-D  convolution  on  the  input  image  pair.  The  convolution 
window  size  is  an  image-size  independent  input  parameter. 

2.  Task  <2-  The  second  task  extracts  the  zero  crossings  of  the  convolved  image  using  a  thresh¬ 
olding  algorithm.  Zero  crossings  represent  edge  features  in  the  image. 

3.  Task  The  third  task  fits  patterns  to  the  edge  features  by  using  a  template  matching 
algorithm.  There  are  24  possible  patterns  that  can  be  fit  to  an  edge  [21]. 
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4.  Task  tf.  The  fourth  task  perforins  a  stereo  match  algorithm  to  match  features  from  the  left 
and  light  images  of  the  same  time  frame  [28].  To  find  a  match  for  a  feature  in  the  left  image 
from  the  right  image,  weighted  sum  of  the  correlation  coefficient  and  the  directional  difference 
weight  between  the  feature  in  the  left  image  and  for  all  the  features  in  the  search  space  of 
the  right  image  arc  calculated.  The  feature  in  the  right  image  that  has  the  maximum  total 
weight  is  considered  as  the  matched  feature.  Details  are  provided  in  [28,  11]. 

5.  Tasks  <5,/6  and  t-.  These  are  similar  to  and  respectively  except  that  the  algorithms 
are  applied  to  stereo  images  separated  in  time  by  wider  margins,  depending  on  the  desired 
accuracy  for  estimation. 

6.  Task  <s-  This  task  performs  a  time  match  algorithm  between  matched  features  of  the  left 
image  obtained  from  and  features  of  the  left  image  obtained  from  ty.  The  time  match 
process  is  similar  to  the  stereo  match  process  except  for  the  fact  that  first  stereo  match 
guides  the  tinn  match  process  and  the  search  space  for  the  time  match  algorithm  is  much 
larger. 

7.  Task  <9.  Finally,  the  ninth  task  performs  a  second  stereo  match  between  the  left  and  right 
images  of  the  stereo  images  from  later  time  frames.  The  output  of  tg  is  a  set  of  3-D  feature 
points  that  describe  the  motion  of  an  object  between  the  two  time  frames. 

All  nine  t.isks  are  repeated  for  image  inputs  obtained  continuouslj.  In  order  to  represent  real-time 
motion  estimation  at  video  frame  rates  the  entire  process  must  be  completed  in  0.0333  seconds. 
The  Image  Urrderstauding  Benchmark  [13]  has  a  similar  structure  of  computation  fiow'  several 
tasks  iiiusi  uc  imri^irii^d  in  a  seiiueiice  in  order  to  recognize  an  object  in  the  scene  and  find  the 
model  that  best  describes  the  object. 

7.2  Shared  and  Distributed  Multiprocessors 

All  nine  tasks  were  implemented  on  a  distributed  memory  machine,  the  Intel  iPSC/2  [7]  and 
a  shared  memory  machine,  the  Encore  Mullimax  [15].  The  Intel  iPSC/2  is  a  circuit-switched 
hypcrcube  multiprocessor.  We  used  a  32  node  iPSC/2  machine.  Each  ttode  consists  of  an  Intel 
80386  processor  and  a  floating  jjoint  co-processor  together  with  -I  Mbytes  of  RAM  and  and  64 
Kbyte  cache.  The  Encore  Multimax  520  is  a  bus  based  system  installed  with  eight  dual  processor 
cards.  Each  dual  incorporates  two  NS32532  processors  each  with  its  of  own  256  Kbyte  cache  of  fast 
static  RAM.  It  has  128  Mbytes  of  shared  memory. 

7.3  Implementation  Results  for  Individual  Tasks 

\\c  implemented  the  task  system  described  above  using  outdoor  images  [11].  Several  methods  for 
implementing  each  algorithm  (e.g.,  block  partitioning,  dynamic  partitioning  [11])  were  used,  for 
each  task,  we  have  selected  the  best  peiformancc  numbers  fiom  these  alternatives.  The  completion 
times  for  each  algorithm  were  tabuhvtcd  and  arc  shown  in  Tables  3  and  4.  Note  that  for  each 
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multiprocessor  size,  tlie  completion  times  include  all  the  overheads,  computation  time  and  com¬ 
munication  time.  Therefore,  when  selecting  a  partition  of  processors  for  a  task,  the  corresponding 
response  time  will  include  all  the  overheads,  computation  time  and  communication  times  (including 
transferring  data  from  one  task  to  the  no.xt).  The  times  in  the  table  are  only  shown  for  selected 
multiprocessor  sizes,  although  individual  tasks  can  be  e.\ecuted  on  an  arbitrary  number  of  proces¬ 
sors.  Since  tlie  sizes  of  tlic  machines  available  to  us  were  limited,  for  the  purposes  of  illustration, 
we  e.xtrapolated  the  completion  times  for  larger  machines  as  shown  in  the  tables.  E.xtrapolation 
was  done  using  the  immediate  speedup  available  from  the  largest  multiprocessor.  For  c.xainple,  we 
computed  the  speedup  (percentage  improvement  in  response  times)  going  from  16  to  32  processors 
for  Intel  iPSC/2  and  then  reduced  this  number  by  five  percent  (the  degradation  in  speedup  in  the 
range  S  to  32);  the  resulting  number  was  taken  as  the  speedup  going  from  32  to  6-1  processors.  The 
portion  of  each  response  time  table  with  times  for  6-1, 128  and  256  processors  was  estimated  in  this 
manner.  It  should  be  noted  that  the  absolute  values  of  completion  times  have  no  impact  of  the 
e.\ecution  of  the  assignment  algorithms  proposed.  If  individual  completion  limes  are  different,  the 
allocation  may  be  different.  The  response  time  functions  in  both  tables  arc  found  to  be  decreasing 
and  convc.x. 

A  basic  premise  of  our  assignment  algoiitluns  is  that  we  can  measure  response  time  functions 
of  elemental  tasks,  then  accurately  compute  the  response  time  functions  of  aggregate  tasks.  The 
premise  w-as  validated  on  this  application —the  measured  response  time  function  for  the  entire 
system  was  found  to  deviiite  from  the  predicted  response  time  function  by  no  more  than  5%  at  any 
processor  count.  This  accuiacj  is  largely  due  to  the  fact  that  the  application  is  compute-bound:  the 
computation-to-communicatioii  ratio  is  100  to  1.  Any  errors  introduced  by  our  simplistic  appro.ach 
to  communication  costs  are  bound  to  be  low.  The  accuracy  is  also  due  in  part  to  the  fact  that  all 
liossible  mappings  of  the  pipeline  weic  coiislruclcd  to  avoid  shaied  communication  channels  one 
can  always  embed  a  chain  in  a  hypcrcube.  Thus,  no  effects  due  to  channel  contention  e.Kist  in 
the  mciisuremcnts.  It  remains  to  see  how  well  our  approach  predicts  response  time  functions  on 
less  compute-intensive  applications.  Nevertheless,  applications  of  the  type  we  consider  hero  arc 
practical,  and  important. 

7.4  Experimental  Results 
7.4.1  The  Response  Time  Problem 

The  algoiithm  for  sciial  tasks  with  convex  response  lime  functions  (in  Section  1)  was  run  using 
Tables  3  and  -1  for  a  i  ange  of  dcsiied  througiiput  constraints.  .-Vs  an  example  of  the  output  generated 
by  the  algorithm,  Table  5  shows  the  pioccssor  assignment  for  individual  tasks  for  various  sizes  of 
the  Intel  iPSC/2.  The  last  row  of  the  table  also  shows  the  minimum  rcsiionse  time  for  the  given 
throughout  constraint  (A  =  O.O-j  tasKs/.sccond).  We  obsci  .c  that  some  throughput  conditions 
cannot  be  met  by  all  sizes  of  multiprocessors.  For  example,  a  throughput  of  0.125  tasks/second 
cannot  be  achieved  for  a  32  or  64  iiroccssor  machine  but  it  can  be  achieved  for  a  128  or  2-56 
proce.ssor  machine  for  which  the  minimum  response  time  was  observed  to  be  22.18  cand  12.98 
seconds  respectively.  Furthermore,  the  acliicvcd  throughput  for  a  128  processor  machine  was  0.157 
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Table  3;  Completion  times  for  individual  tasks  on  the  Intel  iPSC/2  of  various  sizes  (*  indicates 
extrapolated  values) 


Response  Times  for  Individual  Tasks  (Sec.) 

No.  of 
Proc. 

Task  1 

Task  2 

Task  3 

Task  4 

Task  5 

Task  6 

Task  7 

Tasks 

Task  9 

1 

109.0 

6.15 

24.67 

109.0 

6.15 

WSm 

2 

54.76 

mm 

12.52 

51.76 

Kwwtl 

9.15 

4 

27.51 

1.58 

0.081 

6.32 

27.51 

1.58 

0.081 

34.22 

4.-58 

8 

13.88 

0.81 

0.042 

3.22 

13.88 

0.042 

17.50 

2.39 

16 

7.07 

0.40 

0.022 

1.76 

7.07 

0.40 

0.042 

10.30 

1.52 

32 

3.78 

0.20 

0.012 

1.01 

3.78 

0.20 

0.012 

6.36 

1.01 

64* 

2.12 

0.11 

0.007 

0.61 

2.12 

0.11 

0.007 

4.13 

0.71 

128* 

1.25 

0.C6 

0.004 

0.38 

1.25 

0.06 

0.004 

2.81 

0.52 

256* 

0.77 

0.04 

0.002 

0.26 

0.77 

0.77 

0.04 

0.002 

0.40 

Tabic  1:  Completion  times  for  individual  tasks  on  the  Encore  Multimax  of  various  sizes  (*  indicates 
extrapolated  values) 


Response  Times 

or  Individual  Tasks  (Sec.)  | 

No.  of 
Proc. 

Task  1 

Task  2 

Task  3 

Task  4 

Task  5 

Task  6 

Task  7 

Tasks 

Task  9 

1 

352.20 

51.70 

352.20 

16.54 

212.00 

25.50 

2 

176.08 

28.00 

176.08 

8.33 

103.77 

13.10 

.! 

88.38 

15.10 

88.38 

4.26 

51.70 

7.10 

8 

45.42 

2.11 

Ha 

8.70 

45.12 

2.14 

25.98 

4.25 

16 

26.99 

1.23 

0.20 

5.00 

26.99 

1.23 

1 

15.23 

2.76 

32* 

16.84 

0.74 

0.13 

3.01 

16.84 

0.74 

1 

9.37 

1.88 

64* 

11.03 

0.47 

0.09 

1.91 

11.03 

0.47 

1 

6.06 

1.34 

128* 

7.59 

0.31 

0.06 

1.27 

7.59 

0.31 

1 

4.11 

1.01 

256* 

5.48 

0.22 

0.05 

0.89 

5.48 

0.22 

j 

2.93 

0.80 
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Table  5:  An  example  processor  allocation  for  minimizing  response  time  for  several  sizes  of  iPSC/2 
(MRT  =  Minimum  Response  Time,  Specified  Throughput  =  0.05  tasks/sec.,  No.  of  processors 
allocated  to  individual  tasks  are  shown) 


Task 

No. 

Multiprocessor  Size  (No.  of  Procs.) 

32 

64 

128 

•  256 

Time 

(Sec.) 

Time 

(Sec.) 

Time 

(Sec.) 

Time 

(Sec.) 

13.88 

7.07 

32 

3.78 

64 

2.12 

6.15 

3.07 

8 

0.81 

16 

0.40 

0.32 

0.32 

1 

0.32 

2 

0.16 

2 

12.52 

6 

4.77 

8 

3.22 

16 

1.76 

8 

13.88 

16 

7.07 

32 

3.78 

64 

2.12 

6 

HI 

6.15 

2 

3.07 

6 

1.19 

12 

0.60 

■1 

0.32 

1 

0.32 

1 

0.32 

2 

0.16 

8 

17.50 

16 

10.30 

32 

6.36 

64 

4.13 

9 

2 

9.15 

4 

4.58 

8 

2.39 

16 

1.52 

MRT 

79.87 

40.57 

22.18 

12.98 

tasks/seconds  and  for  a  25G  processor  machine  the  achieved  throughput  was  0.242  tasks/seconds. 

Figure  5  shows  the  optimal  response  time  function  for  the  entire  pipeline  computation  together 
with  the  achieved  throughput  using  the  hypercube  data.  As  we  might  expect,  the  response  time 
function  is  decreasing  and  the  achieved  throughput  is  increasing.  Figure  6  shows  response  times  for 
specified  throughput  of  A  =  0.05  tasks/second  for  different  hypercube  sizes.  Along  with  the  response 
time  function  from  Figure  5,  two  curves  are  shown  to  provide  a  comparison  with  non-optimal,  yet 
sii.iple,  heuristics  for  processor  assignment.  The  first  heuristic,  called  the  equal  allocation  heuristic, 
allocates  an  equal  number  of  processors  to  each  task,  thus  ignoring  the  response  time  functions  of  the 
individual  tasks  (this  takes  0{n)  time).  The  second  heuristic,  called  the  ratio  heuristic,  attempts 
to  take  these  functions  into  account  through  the  use  of  ratios;  initially  each  task  is  assigned  a 
processor,  the  remaining  processors  are  distributed  in  proportion  to  the  quantities  /i(l),l  <  f  <  w 
for  each  of  the  n  tasks  (requiring  0{u)  time).  Our  optimal  algorithm  (0(n  logp))  always  achieves  a 
lower  response  time  than  the  two  simple  0{n)  heuristics.  Comparing  the  achieved  throughputs  in 
Figure  7,  it  can  be  observed  that  the  ratio  heuristic  achi  as  higher  throughput  than  the  optimal 
algorithm  because  it  does  not  tradeoff  throughput  for  achieving  the  minimum  response  time,  i.e., 
the  heuristic  is  not  ginaanteed  to  satisfy  the  lesponse-time  constraint.  The  equal  allocation  strategy 
performs  rather  poorly  as  one  might  expect. 

The  tradeoff  of  response  time  versus  throughput  constraint  (using  optimal  response  time  func¬ 
tions)  is  studied  in  Figures  8  and  9  for  a  128-  and  256-processor  hypercube.  Figure  8  shows  the 
response  time  and  Figure  9  shows  the  corresponding  achieved  throughput  as  a  function  of  the 
specified  throughput.  As  we  can  observe,  the  response  time  curve  follows  the  throughput  curve 
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Figure  5:  Response  Time  Problem:  Response  Time  and  Achieved  Throughput 


Comparison  of  response  times  for 
specified  throughpul=0.05,  for 


number  of  processors 


Figure  6:  Response  Time  Problem:  Comparison  with  heuristics 
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Figure  7;  Response  Time  Problem:  Achieved  throughputs  for  heuristics 


Comparison  of  response  times  for 
128  and  256  processor  hypercubes 


Figure  8:  Response  Time  Problem:  Response  time  with  increasing  throughput  constraint 


Comparison  of  achieved  throughputs 
for  128  and  256  processor  hypercubes 


specified  throughput 


Figure  9:  Response  Time  Problem:  Achieved  throughput  with  increasing  throughput  constraint 


response  time  and  ach.  throughput  for 
specified  throughput=0.0125 


no.  of  processors 


response  time 
ach.  throughput 


Figure  10:  Response  Time  Problem:  Results  for  Encore 
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Figure  11;  Throughput  Problem:  Throughputs  and  achieved  response  times 

in  shape;  this  clearly  indicates  that  the  algorithm  trades  off  response  time  to  achieve  the  specified 
throughput.  This  is  exemplified  at  high  throughput  constraints  where  the  minimum  response  time 
increases  significantly  in  order  to  achieve  the  specified  throughput.  For  low  values  of  specified 
throughput,  the  change  in  minimum  response  time  is  insignificant  because  the  throughput  can  be 
achieved  easily  with  the  given  number  of  processors.  For  a  larger  system  the  knee  of  the  curves 
shifts  to  the  right  as  expected  due  to  the  additional  resources  (as  shown  for  a  250-processor  system). 
1  inally,  Figure  10  plots  the  response  time  as  a  function  of  the  number  of  processors  for  the  Encore 
data.  The  graph  is  seen  to  closely  resemble  Figure  5.  To  avoid  repetition,  we  do  not  show  further 
results  for  the  Encore. 

7.4.2  The  Throughput  Problem 

F  gnre  11  illustrates  the  maximum  throughput  obtained  and  the  corresponding  achieved  response 
time  for  our  task  system  when  the  specified  response  time  p  =  100  seconds.  The  results  generated 
by  the  two  heuristics  described  earlier  are  presented  in  Figure  12.  The  optimal  algorithm  generates 
higher  throughputs  than  achieved  by  the  two  heuristics.  Figure  13  shows  the  achieved  response 
times  when  using  the  heuristics.  The  ratio  heuristic  achieves  a  lower  response  time  than  that  by 
the  optimal  algorithm  because  it  docs  not  necessarily  satisfy  the  throughput  constraint. 

The  tradeoff  between  response  time  and  throughput  is  shown  once  again,  this  time  in  the  con¬ 
text  of  the  throughput  problem,  in  Figures  14  and  15  for  128  and  256  processor  hypcrcubes  as  a 
function  of  the  specified  response  time.  The  solid  line  shows  the  maximum  possible  throughput 
wlieu  there  is  no  response  time  constraint.  Therefore,  for  any  specified  response  time,  the  differ¬ 
ence  between  the  maximum  throughput  and  unconstrained  maximum  throughput  represents  the 
amount  of  throughput  tradeoff  to  achieve  the  specified  response  time.  Furthermore,  we  can  observe 
that  as  the  specified  response  time  increases,  the  difference  between  the  unconstrained  ma.ximum 
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Comparison  of  max.  throughput  of 
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ratio  hour, 
equal  alloc. 


Figure  12:  Throughput  Problem:  Throughputs  obtained  by  heuristics 
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Figure  13:  Throughput  Problem:  Achieved  response  times  for  heuristics 
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Comparison  of  throughputs  for 
128  and  256  processor  hypercubes 


specified  response  time 


Figure  14:  Throughput  Problem:  Maximum  throughput  with  increasing  response  time  constraint 

throughput  and  throughput  reduces  because  of  the  weakening  of  the  response  time  constraint.  Be¬ 
yond  a  certain  point,  the  response  time  constraint  is  so  weakened  that  the  maximum  unconstrained 
throughput  is  achieved  as  shown  by  the  plateau  in  the  throughput  curve.  This  phenomenon  is  also 
observed  in  functional  pipelines  in  processor  designs  where  inserting  delays  in  the  pipeline  stages 
results  in  higher  throughout  at  the  cost  of  response  time  [26,  34,  40]. 

8  Summary 

In  this  paper  we  have  formulated  the  problem  of  optimizing  the  performance  of  a  pipeline  computa¬ 
tion,  represented  by  a  task  structure,  on  a  parallel  architecture,  given  a  large  supply  of  processors, 
and  the  experimentally  determined  response  time  functions  for  its  constituent  tasks.  Unlike  prior 
treatments  of  the  mapping  problem  we  considered  the  case  where  there  are  many  more  processors 
than  tasks  and  where  tasks  are  not  queued  or  scheduled.  W’a  considered  the  dual  problems  of  min¬ 
imizing  response  time  subject  to  a  throughput  constraint,  and  maximizing  throughput  subject  to 
a  response  time  constraint.  As  we  observed  in  our  sample  application,  these  problems  arc  compli¬ 
mentary,  in  the  sense  that  allocation  to  increase  throughput  may  have  the  side  effect  of  increasing 
response  time,  and  vice  versa. 

The  problem  posed  in  this  paper  was  shown  to  be  solvable  in  polynomial  time  for  a  useful  class 
of  task  structures.  Specifically  we  presented  0{np^)  algorithms  (where  n  is  the  number  of  tasks 
and  p  is  the  number  of  piocessors),  for  the  response  time  problem,  for  the  cases  where  the  task 
structures  are  linear,  tree-structured  and  series-parallel  graphs.  The  algorithms  designed  for  the 
response  time  problem  can  be  used  to  solve  the  throughput  problem  with  an  additional  logarithmic 
factor  in  comple.xity.  To  place  the  work  in  a  realistic  setting  we  considered  an  application,  stereo 
image  matching  on  two  parallel  architectiues,  and  evaluated  the  performance  of  our  assignment 
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Comparison  of  achjsved  response  times 
for  128  and  256  processor  hypercubcs 


P=128 

P=256 


Figure  15:  Throughput  Problem:  Achieved  response  times  with  increasing  response  time  constraint 

algorithms.  Future  endeavors  include  the  provision  of  algorithms  for  general  task  structures  and 
investigation  of  faster  and  parallelized  assignment  algorithms. 
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